Incorporating global dynamics to improve the accuracy of disease models: Example of a COVID-19 SIR model

Mathematical models of infectious diseases exhibit robust dynamics, such as stable endemic, disease-free equilibriums or convergence of the solutions to periodic epidemic waves. The present work shows that the accuracy of such dynamics can be significantly improved by including global effects of host movements in disease models. To demonstrate improved accuracy, we extended a standard Susceptible-Infected-Recovered (SIR) model by incorporating the global dynamics of the COVID-19 pandemic. The extended SIR model assumes three possibilities for susceptible individuals traveling outside of their community: • They can return to the community without any exposure to the infection. • They can be exposed and develop symptoms after returning to the community. • They can be tested positively during the trip and remain quarantined until fully recovered. To examine the predictive accuracy of the extended SIR model, we studied the prevalence of the COVID-19 infection in six randomly selected cities and states in the United States: Kansas City, Saint Louis, San Francisco, Missouri, Illinois, and Arizona. The extended SIR model was parameterized using a two-step model-fitting algorithm. The extended SIR model significantly outperformed the standard SIR model and revealed oscillatory behaviors with an increasing trend of infected individuals. In conclusion, the analytics and predictive accuracy of disease models can be significantly improved by incorporating the global dynamics of the infection.

reduce the spread of COVID-19, businesses, communities, and governments have implemented different control measures such as mandatory lockdowns, social distancing, avoiding crowded events, using face masks in public, and vaccination [3]. Nevertheless, control of COVID-19 remains a major issue in several parts of the world [4].
COVID-19 is mainly transmitted from human-to-human via direct contact with contaminated surfaces and through the inhalation of respiratory droplets from infected individuals [5]. About 97% of the infected individuals will recover after period ranging between one to four weeks. Therefore, the use of mathematical modeling seems to be an appropriate approach to study COVID-19 transmission dynamics.
Mathematical modeling of infectious diseases has increasingly become an essential tool for prevention, prediction, and control of infectious diseases [6][7][8]. Since 1760, when Daniel Bernoulli developed the first disease model of smallpox, numerous mathematical models have been utilized to study disease transmission dynamics, and to predict, assess, and control infectious diseases [9][10][11][12].The substance of mathematical modeling lies in formulating a set of mathematical equations that mimic reality [13]. Mathematical models have been evolved from small sets of ordinary differential equations to sophisticated compartmental models with several equations (see [14][15][16] for a review).
One of the simplest, yet powerful, disease models is the standard Susceptible-Infected-Recovered (SIR) model, which was first introduced by Kermack and McKendrick in a series of three papers [17][18][19]. In a standard SIR model, the host population is divided into susceptible, infected and recovered individuals, denoted by S(t), I(t) and R(t), respectively. These quantities track the numbers of individuals in each compartment over different time periods [20,21]. The standard SIR model without birth and death is represented by the set of ordinary differential equations [22]: Where β is the average number of susceptible individuals infected by one infectious individual per contact per unit of time (the transmission rate), and γ is the average number of infected individuals recovered per unit of time (recovery rate). For decades, the standard SIR model has been extended to various forms by adding different compartments to suit the biological, spatio-temporal and social aspects of the disease dynamics or to study the impact of intervention strategies on the disease transmission dynamics in different communities [23,24]. For instance, it has been extended to SIR models with diffusion [25], contaminated environment [26,27], delay terms [28], several strains of infection [29], and multiple routes of infection [30].
Recently, several researchers utilized mathematical modeling to analyze, and predict the transmission dynamics of COVID-19 pandemic [31][32][33]. Dynamics of COVID-19 epidemic has been simulated using different versions of SIR or SEIR (susceptible, exposed, infected and recovered) models [31,32]. The main modification include adding asymptomatic and symptomatic infection compartments [33], hospitalization compartment [34], and quarantined and isolated compartments [31]. These models are presumably able to predict and simulate the number of infected cases by taking into consideration the asymptomatic and symptomatic cases, deaths, needs of beds in hospitals, and effect of control measures and the interventions to decrease the number of cases.
The abovementioned extended SIR models contribute to the existing literature. However, they largely ignore the effects of global dynamics of infection on local communities. The presence of a global pandemic or a widespread infection can largely influence the dynamics of infection in a local community. Most communities are well-connected and the assumption that the disease exists only within the community is invalid [35].
There have been attempts to include the global dynamics in different SIR models [28,36]. Nevertheless, such extended SIR models have several unknown parameters and poorly fit to data of host population. Due to lack flexibility and poor fitness to data, there is a need for develop SIR models that are more practical.
Moreover, regardless of the parameter values, most numerical simulations of SIR models are limited to three distinct dynamics. The first of these dynamics is the solution curve of infected individuals may exhibit an epidemic wave before converging to a disease-free equilibrium [37], secondly, the solution curve of infected converge to an endemic equilibrium [38], the third of these dynamics is the solution curve of infected converge to periodic epidemic waves [39]. For the standard SIR model (1), the dynamics are even more limited. Namely, the solution curves always represent the same qualitative dynamics: an epidemic wave of the infectious population, an inverted S shape for susceptible population, and S shape for the recovered population. Regardless of the set of parameter values and initial conditions, such qualitative behaviors will always remain the same (see panels Fig 1A-1C). State [40], or city level [41], shows that the dynamics of COVID-19 is more complicated than a single epidemic wave. Therefore, it is essential to include the global dynamics of infection in a disease model.
The present work aims to address this issue. We extend the SIR model to a new model that includes the global impacts of the infection and is also capable of fitting well to infectious disease data. To incorporate the global effect and test the predictive accuracies of the extended model, we selected randomly six cities and states influenced by the COVID-19 global pandemic: Kansas City, Saint Louis, San Francisco, Missouri, Illinois, and Arizona.
We assumed three possibilities to consider in the SIR modeling of COVID-19: susceptible individuals from a local community can travel in and out of their community without any exposure to COVID-19, they can be exposed to COVID-19 while traveling and develop symptoms after they return to their community, or they can be diagnosed with COVID-19 during their traveling and return to their community after recovery. In the next section we include all these three possibilities in the extended SIR model.
The rest of the paper is organized as follows. In section. 2, we introduce the extended SIR model and its formulation. We explain the methodology that was used to fit the extended SIR model to COVID-19 data. In section. 3, we present the results of our analysis based on the six cities and states. In section. 4, we provide a discussion of the main results and additional factors to consider in the modeling process.

Data
The COVID-19 data used in this study were obtained from the health department of Kansas City, Saint Louis, San Francisco, Missouri, Illinois, and Arizona [42][43][44][45]. The data were dated from March 10, 2020, to March 7, 2021(a total of 363 days). We did not include data after March 7, 2021, because our model does not include the effects of vaccination and it would be inappropriate to include the data thereafter. Specifically, the data variables consisted of date, total number of cases, new cases, total deaths, new deaths, and total number of individuals tested for COVID-19.
We used abovementioned data to extract the daily number of recovered, susceptible, and infected individuals (see S1 File for the algorithms used for generating the data). Table 1 provides the basic descriptive statistics of Kansas City, Saint Louis, San Francisco, Missouri, Illinois, and Arizona of daily COVID-19 data of infected individuals.
Observe that the statistics of susceptible and recovered data are comparable (see S1 File for the basic descriptive statistics). However, the statistics of infected individuals are at a much Table 1 [46,47].

. Descriptive Statistics of Kansas City (KC), Saint Louis (SL), San Francisco (SF), Missouri (MO), Illinois (IL), and Arizona (AZ) daily COVID-19 data from
To estimate the number of susceptible individuals, we assumed an average incubation period of 5 days for COVID-19 [48]. We also considered one day for obtaining the COVID-19 test results. Hence, all of those who were tested positive were susceptible from the beginning until 6 days prior to obtaining the test results. Also, we added the individuals who take the test, but their results were negative. These individuals had presumably high risk of getting infected and therefore susceptible. The number of infected individuals were calculated by considering an average infection period 14 days [49]. Hence, we cumulatively added of new cases for 14 days until they recovered.

Model formulation
We divided our population of N individuals living in a local community into sup-populations (i.e., compartments) of susceptible compartment S(t), infected compartment I(t), and recovered compartment R(t). As shown in Fig 2, the extended SIR model of COVID-19 transmission assumes three possibilities for susceptible individuals traveling outside of the community: They can return to the community without any exposure (the net rate is f(t) = f 2 (t)-f 1 (t)), they can be exposed COVID-19 and develop symptoms after returning to the community (the inflow rate of g(t)), or they can be tested positive during their trip and remain quarantined until fully recovered and thereafter return to the community (the inflow rate of h(t)). The extended SIR model is formulated by the following system of deterministic non-linear differential equations and where β and γ are the same parameters as in system (1). Functions f(t), g(t) and h(t) are differentiable and bounded functions and take into account the global effects of the infection. To avoid overfitting, our goal is to estimate f, g and h using least complicated forms. Although adding exposed population can provide interesting dynamics, we decided to exclude the exposed compartment from our modeling. This is due to lack of data associated with exposed population. Namely, there is no known method of accurately identify the time series of exposed population in a community. In addition, the more compartments are added the harder it becomes to accurately estimate the parameter values. In some cases, the confidence intervals of estimated parameter values become extremely large due to high number of parameters and insufficient amount of data. With this rationale in mind, we therefore employ a two-step method to estimate the parameters of the model.
Individuals can get infected both within and outside of the community. A standard SIR model has been considered for progression of infection within the community. The extended SIR model coupled with a global SEI model where S G , E G , and I G are the global of susceptible, exposed, and infected compartments respectively, and S L , I L , and R L are the susceptible, infected, and recovered compartments in the local community, respectively. The extended model assumes three possibilities for susceptible individuals traveling outside of the community: They can return to the community without any exposure (the net rate is f(t) = f 2 (t)-f 1 (t)), they can be exposed to the infection and develop symptoms after returning to the community (rate of g(t)), or they can be tested positive during their trip and remain quarantined until fully recovered and thereafter return to the community (rate of h(t)).

Model fitting
The single-step numerical methods such as linearization and discretization [50] to estimate the parameter values of model (2) fail to converge, due to high degrees of freedom and unknown intervals of parameter estimations. We therefore proposed a two-step process for parameters estimation of model (2). First, we estimated the parameter values of the standard SIR model (1) and then we determined the functions f(t), g(t) and h(t) using the residual data of S(t), I(t) and R(t) subpopulations. As mentioned before, we used the COVID-19 data of susceptible, infected, and recovered individuals in Kansas City, Saint Louis, San Francisco, Missouri, Illinois, and Arizona for an epidemic period starting from March 10, 2020, to March 7, 2021.
In the first step, we numerically solved the system (2) using the MATLAB ode45 solver which is based on the fourth order Runge-Kutta method. The stability of the method is well established in [51]. For data fitting, the optimization function "fmincon" was used along with the common technique of the least-squares method [52,53]. This method minimizes the sum of the squared residuals, that is, the difference between model predictions and their corresponding data values. The sum of the squared residuals is calculated using the formula below where M represents the total number of data points considered for fitting and y and y i represent the values predicted by the model and those from the data, respectively. We estimated the SIR parameter values by considering the following factors. Several studies indicated that the COVID-19 transmission rate of infection was 0.5 [54,55]. Hence, we set β = 0.5. Also, some studies assumed the average recovery period (i.e 1/ γ) is about 7 days [54,55], which results in the initial value of γ = 0.13. Also, to be consistent with the data, we set our initial conditions to the number of susceptible, infected and recovered at = 1. The estimated model parameters are provided in Table 2.
In the second step, we fitted the global effect of infection on the community by estimating functions f(t), g(t) and h(t) in model (2).
Although model selection can be done using Akaike's Information Criteria (AIC) and Bayesian Information Criteria (BIC) methods, we used MATLAB curve fitting toolbox to measure the goodness of fit (adjusted R 2 , sum of the squared residuals, etc.) to find the optimal forms of functions f(t), g(t) and h(t). Specifically, the model fitting resulted in the following forms:

3-Results
Using the COVID-19 data of each city and states, we estimated the functions corresponding to the global effects f(t), g(t) and h(t) in model (2) using the abovementioned two steps in Kansas City, Saint Louis, San Francisco, Missouri, Illinois, and Arizona. The estimated net rate for the susceptible individuals who can return to the community without any exposure is given by f(t) = λ 1 t + λ 2 . The estimated net rate of the individuals who exposed to the infection and develop symptoms after returning to the community is given by g(t) = a 1 b 1 cos(b 1 T+c 1 )+ a 2 b 2 cos(b 2 T +c 2 )+ a 3 b 3 cos(b 3 T+c 3 )., and the estimated net rate of the individuals who tested positive during their trip and remain quarantined until fully recovered and thereafter return to the community is given by h(t) = p 1 t + p 2. All parameter estimations of step 1 and step 2 are summarized in Tables 2 and 3, respectively. From Table 2, we noticed that the transmission rate in the states; Missouri, Illinois, and Arizona are much higher than the cities; Kansas City, Saint Louis, and San Francisco, which is due to the larger population size of states and cities. Note that the estimated values for recovery rates are different than those of the standard SIR model because of the recovery rate is influenced by the global effects. In Table 3, there are three parameters, a, b and c where a represents the amplitude of each wave for each city or state, b represents the frequency of each wave, and c represents the phase shift of each wave. Also, the value of T represents the period of each wave. For instance, the state of Missouri and the city of Kansas City have almost the same of three periodic epidemic waves of COVID-19. One happens every two years, the other epidemic wave every four months and half, and the last wave every nine months. The periodic waves could be associated with several factors such Table 3.  as, traveling, major social events, infection prevention policies, changes to the coronavirus itself, and the increase of people who become susceptible because they have not developed some immunity. [56]. The panel of Fig 3A-3F shows that the extended SIR model of the susceptible solution curve fits well to the data of susceptible subpopulation in the cities and states. Likewise, S7 Similarly, the panel of Fig 4A-4F shows the extended SIR model in the subpopulations of recovered individuals fits well to the data and have a goodness of fit R 2 = 0.9873, 0.9893, 0.9804, 0.9844, 0.9748, 0.9606, respectively (See S7 Table in S1 File).

Estimated parameter values of the global functions f(t), g(t) and h(t) based on data of Kansas City (KC), Saint Louis (SL), San Francisco (SF), Missouri (MO), Illinois (IL), and Arizona (AZ
The panel of Fig 5A-5F shows that the dynamics of COVID-19 in the cities and states are more complicated than a single epidemic wave as common in the standard SIR model. The extended model is capable of revealing the underlying behaviors hidden in the data. Also, the extended SIR model in the subpopulations of infected individuals fits well to the data and have goodness of fit R 2 = 0.9542, 0.9287, 0.962, 0.9438, 0.9015, 0.9492 for Kansas City, Saint Louis, San Francisco, Missouri, Illinois, and Arizona data, respectively. (See S7 Table in S1 File). Therefore, the inclusion of global effects to the SIR model can substantially improve the predictive accuracy of the model.

4-Discussion
The present work highlights the importance of including global dynamics of infection in disease models to achieve higher predictive accuracies. We introduced a two-step algorithm for accurate estimation of infection parameters by considering both global and local effects of the infection spread in SIR models. The first step leads to estimation of local parameters (i.e., the transmission and recovery rates, β and, respectively) whereas the second step incorporates the global effects of the infection (i.e., estimation of functions f(t), g(t) and h(t)). To test the methodology, we applied the two-step model fitting algorithm to the extended SIR model (2) using Kansas City, Saint Louis, San Francisco, Missouri, Illinois, and Arizona data from March 10, 2020, to March 7, 2021. As shown in the panels of Figs 3-5, the two-step method resulted in solution curves that fit well to the COVID-19 data. The goodness of fit becomes more apparent when it is compared to that of the standard SIR model. Therefore, we compared the standard SIR model with the extended SIR model using the first 212 Kansas City COVID-19 data. As shown in Fig 6A and 6B, the solution curves of the standard SIR model poorly fit to the COVID-19. Moreover, Table 4 shows the comparisons of model fitness for the standard and extended SIR model. The extended SIR model of the susceptible solution curve has R 2 = 0.9905 while in standard model R 2 = 0.1551. Similarly the extended SIR model outperformed the standard SIR model in the subpopulations of recovered individuals (R 2 = 0.9912 versus R 2 = 0.47), and the subpopulation of infected individuals (R 2 = 0.7083 versus R 2 = -258.65). Note that the negative R 2 value is because the classical SIR model does not follow the trend of the data.
In addition to higher predictive accuracies of the extended SIR model (2), the solution curves revealed oscillatory behaviors with an increasing trend of infected individuals. This contrasts with the standard SIR model, where regardless of chosen parameter values, the solution curves always exhibit the same qualitative behaviors (see Fig 1A-1C).
Although the standard SIR has been proven useful to study local dynamics of various infections, it fails to capture the global effects of a widespread disease. The failure of standard SIR model to forecast the COVID-19 pandemic can be described by a variety of factors. One of these factors is that the standard SIR model assumes the population is closed, isolated, and ignores the effects of the global dynamics of infection on neighboring communities which is not a valid assumption [35]. Hence, by including the global infection effects in the disease models, we can identify underlying mechanisms governing the dynamics of infectious diseases.
In spite of several studies indicate that include temperature factor in the SIR model could potentially improve the model outcomes [57], recent studies suggested that the temperature has no influence on the propagation of the COVID-19 virus [58,59]. In fact, some strains of the virus alter depending on their environments. They may live and grow in a variety of geographical areas or temperatures. Outside of laboratory tests, there is no way to anticipate how the virus would react in heat and humidity or even cold and dry temperatures.
The inability of the standard SIR model to fit there COVID-19 data has been identified by other researchers [60]. Nonetheless, the presence of a breakpoint due to strong policy  interventions, mentioned in [60], does not necessarily reduce the prevalence of infection in a community dealing with a pandemic.
In conclusion, including the global dynamics of the infection and applying the two-step model fitting algorithm can enable us to extract vital information (e.g., presence of epidemic waves) from the data.